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SUBSPACE ESTIMATION AND PREDICTION METHODS FOR 
HIDDEN MARKOV MODELS 

By Sofia Andersson and Tobias Ryden 1 

AstraZeneca R&D and Lund University 

Hidden Markov models (HMMs) are probabilistic functions of fi- 
nite Markov chains, or, put in other words, state space models with 
finite state space. In this paper, we examine subspace estimation 
methods for HMMs whose output lies a finite set as well. In particu- 
lar, we study the geometric structure arising from the nonminimality 
of the linear state space representation of HMMs, and consistency of 
a subspace algorithm arising from a certain factorization of the sin- 
gular value decomposition of the estimated linear prediction matrix. 
For this algorithm, we show that the estimates of the transition and 
emission probability matrices are consistent up to a similarity trans- 
formation, and that the m-step linear predictor computed from the 
estimated system matrices is consistent, i.e., converges to the true 
optimal linear m-step predictor. 

1. Introduction. A hidden Markov model (HMM) is a time series model 
for which the observable output is governed by a nonobservable finite-state 
Markov chain. HMMs have been applied in a wide range of areas, including 
telecommunications, speech recognition, bioinformatics and econometrics. 
See the monographs [8, 12, 19, 22] as well as the somewhat old but still 
informative tutorial [28] for further reading and references. 

Estimation of HMMs is, in most cases, carried out using maximum like- 
lihood (ML), and it is known that the MLE enjoys the standard favorable 
statistical properties: consistency, asymptotical normality and asymptotic 
efficiency [5, 7, 21]. However, from a computational point of view, the MLE 
is less tractable; the log-likelihood function is highly nonlinear and typically 
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severely multi-modal. Thus, iterative algorithms like the EM algorithm, of- 
ten referred to as the Baum- Welch algorithm in the context of HMMs [6], 
are needed to maximize the log-likelihood, and there is no guarantee that 
such algorithms will find the actual global maximum — the MLE. 

Subspace methods have proved very successful for estimating linear Gaus- 
sian state-space models. These methods are noniterative and based on ma- 
trix computations that are numerically stable. An HMM can be formulated 
as a linear state-space model, and in [16], subspace estimation was applied to 
a simple HMM; no further theoretical basis for the validity of the procedure 
was provided, but encouraging numerical results were reported. The purpose 
of the present paper is to give a theoretical foundation for the application of 
subspace methods to HMMs. Our main results are two consistency theorems. 
The first result states that the estimated transition and emission probability 
matrices are consistent, up to a similarity transformation (change of basis). 
The second result states that the optimal linear m-step predictor computed 
from the estimated matrices, is consistent in the sense that it converges to 
the corresponding optimal predictor given the true, unknown, system ma- 
trices. This second result can also be viewed as HMMs, estimated with sub- 
space methods, being a general tool for very quickly (complexity discussed 
in next paragraph) building predictors for discrete time series (or, more pre- 
cisely, time series taking values in a finite set). In this respect, the subspace 
approach is analogous to building predictors with finite-dimensional state- 
space models (or, equivalently, ARMA models) for real-valued time series. 

Difficulties remain however, in particular the so-called positive realization 
problem, which here amounts to determining a representation of an HMM 
second-moment function in terms of transition probability matrices which 
are positive and with elements between zero and one. Significant progress 
has been made on this subject [1, 30], and there exist locally convergent 
algorithms to find positive realizations for related problems (cf. Section 9), 
but it is fair to say that the problem is not settled. Another drawback of 
subspace methods is that they do not allow for including structural con- 
straints on the transition and/or emission probability matrix. For instance, 
it may be that some elements of these matrices are identically zero (by 
modeling assumptions), or that several probabilities are jointly given by a 
common lower-dimensional parameter. Such structures are usually easily in- 
corporated by, for example, the EM algorithm, but cannot be handled with 
subspace methods. See Section 9 for a remark on zero elements. The advan- 
tage of subspace methods is rather on the computational side. With ML, 
one evaluation of the log-likelihood (amounting to running the forward pass 
of the forward-backward algorithm) , or one iteration of EM (which is domi- 
nated by the complexity of the E-step) has complexity 0(Tn 2 ) with T being 
the number of observations and n being the cardinality of the state space 
([8], Section 5.1.1) and many evaluations or iterations need to be carried out. 
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A subspace estimation algorithm first builds covariance matrices of size (M) 2 
for some k from T observations, where I is the cardinality of the set of obser- 
vations, and then computes the singular value decomposition of a matrix of 
the same size; the complexity is thus 0{T{klf) + 0{{klf) (cf. [13], Section 
5.4.5). Typically, k is chosen of the order logT, and the complexity is then 
slightly higher than linear in T. In practice, a subspace method is much 
faster than a likelihood-based method however, because it is noniterative 
and all steps are carried out just once. This difference becomes particularly 
pronounced when n and/or £ is large, since likelihood-based methods then 
in general suffer the most from multi-modal log-likelihood surfaces, require 
many iterations to converge, and are prone to converge to local maxima. 

We do not give a general introduction to subspace methods in this pa- 
per, but refer the reader to, for example, [18, 26]. A problem we do not 
consider is order estimation, that is, estimating the number of states of the 
hidden Markov chain. It has been suggested in [27], Chapter 8, to use the 
singular value decomposition of the subspace algorithm for this purpose, 
by estimating the number of nonzero singular values. It is, however, not 
obvious that this approach immediately carries over from linear Gaussian 
models to HMMs. In the present paper, we leave this issue alone assuming 
that the model order is known beforehand or estimated otherwise. Another 
topic that we will not elaborate on is the choice of truncation index, or time 
horizon, k, of the subspace algorithm. Although our consistency result gives 
guidelines on how large, asymptotically, k is allowed to be, it gives no infor- 
mation on what a good choice for k for a finite set of data would be. It has 
been proposed to select k as to minimize certain criteria — see [4], Chapter 
5, and [27], Chapter 8 — but again we do not consider this problem further. 

The paper is organized as follows. Section 2 sets the notation for the 
HMM and some matrices used by the algorithm, and Section 3 describes 
various state-space representations of the model. Section 4 discusses linear 
prediction of future observations from past ones, while Section 5 summarizes 
the algorithm. Section 6 explores structural properties of the system related 
to its nonminimality, and Section 7 contains consistency results. The paper 
is concluded with a small simulation study in Section 8, and a discussion of 
the theoretical results in Section 9. 

2. Notation. We consider an HMM with finite state space and observa- 
tions in a finite alphabet. More specifically, the model comprises a nonob- 
servable (hidden) Markov chain {x^ I }'^_ 00 (the state process) and an ob- 
servable discrete-time process {yt}'^ = _ 00 such that (i) given {x^ 4 }, {yt} is a 
sequence of conditionally independent random variables, and (ii) the con- 
ditional distribution of yt+i depends on only. The state space will be 
denoted by {ei,...,e n } (its size is n) and the output alphabet will be de- 
noted by {d\, . . . ,de} (its size is €). Assuming that {xf 1 } is stationary, the 
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model is completely characterized by the nxn transition probability matrix 
A and the I x n emission probability matrix C, with elements 

aij = F(xfl ± = e.i\xf = ej), Cjj = F(y t +i = <k\xf = ej); 

here P denotes probability. This notation is somewhat nonstandard in that 
usually aij is the probability of moving from ej to ej (but here the oppo- 
site) and xf 1 governs the conditional distribution of yt (but here yt+i)- As 
will become clear below, however, the present notation leads to convenient 
expressions in what follows. 

Without loss of generality, we will let ej be the ith coordinate (column) 
vector in R n . ej E R n thus consists of zeros, except for its iih. element which 
is unity. Similarly, <ij will be taken as the ith unit vector in R . In fact, 
this choice is crucial in order to represent the HMM as a linear state-space 
model. Both A and C have entries in [0, 1] and column sums equal to unity; 
we say that they are column stochastic. These constraints can be written 
ljj4 = lj and lJC = lj, where 1„ is a length n column vector of ones and 
superindex T denotes matrix transposition. Also note that ljx^ = 1 and 
ljyt = 1- We will write S n for the linear subspace of R n spanned by l n , and 
S^; for its orthogonal complement. We will write E for expectation, and l n 
(as above) and n for a length n column vector of ones or zeros, respectively. 
The following assumption will be imposed. 

Condition A. A is irreducible and aperiodic (i.e., ergodic). 

Under this assumption, A admits a unique stationary distribution ir satis- 
fying Air = 7T. Note that tt = K[x^] and, similarly, E[yJ = Cir. Moreover, A 
has one eigenvalue on the complex unit circle (namely 1), while its remaining 
eigenvalues lie inside it. This implies that A 1 — ► 7rl„ as t — > oo. 

3. State-space representations. In order to apply subspace methods, we 
need to formulate the HMM in state-space form. In [12], Chapter 2, it was 
shown that the model may be written as 

(3.1) xf^=Axf + 6+1, 

(3.2) y t+1 = Cxf + r) t+ i, 

where {£t} and {rjt} are independent sequences of martingale increments. 
Hence, both processes contain uncorrelated elements, but neither process 
has elements that are independent or Gaussian. Iterating (3.1) leads to xf 1 = 
J2o^ -A?£t-j + where the sum is well defined because lj£t = 0. This system 
also admits a prediction error (or, innovation) representation [3] , 

(3.3) x t+ i = Ax t + Ke t+ i , 



(3.4) 



y t +i = Cxt + et+i- 
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Here, x% is the optimal linear predictor of xf given all past observable 
information {y s } t s= _ 00 . Note that ljxt = I. Moreover, e t is the one-step- 
ahead linear prediction error of yt given {ysj^Z 1 .^. Hence, Eet = and 
{et} is an uncorrelated sequence. The n x i matrix K, the Kalman gain, 
solves a "singular" Riccati equation and satisfies l^K = 0. Finally, define 
the centered processes xt = xt — vr and yf 1 = y^ 1 — Cir, both of which have 
zero mean, as well as the matrix A = A — irl^. Subtraction by irl^ moves 
the eigenvalue 1 of A to 0, but otherwise leaves the eigenvalues and -vectors 
of A alone. Thus, A is a stable matrix, that is, its spectral radius is less than 
one. Using Air = ir and lj t xt = 0, we find 

(3.5) x t+ i = Ax t + Ke t+1 = Ax t + Ke t+ i , 

(3.6) y t+1 = Cx t + e t+1 . 

Since l~t x t = 1] t ne ^-process lives in an (n — l)-dimensional affine sub- 
space of R n . A major idea of the present paper is to restructure this process 
into one component x living in the (n — l)-dimensional linear space S^, and 
one component living in the null-dimensional affine space {1}. Thereby, we 
need to estimate the (n — l)-dimensional component in the linear space only, 
whereas the remaining affine component is for free. 

We now introduce the notation 

vt = (yJ+i ,yJ+2,---) T , yt 0) = ivt+i . yt+2 . • ■ • . yt+k) T , 
yt = (% T >%-i>---) T > vt( k ) = (yt ,yt-n---iyt-k+i) T ■ 

These quantities are all column vectors. Vectors ef , e~j~ , etc., are defined in 
a corresponding manner, as are the vectors yf, , etc., of centred observa- 
tions (subtracting Cir). Further, we define the covariance matrices 

H = nvtyt T i n k = E[y+(k)yt(k) T ], 

r = nvtm T ], r fc = E[y t -(fc)yr (fc) T ]. 

Here, Tik contains the first k (size I) block rows and block columns of Ti., 
and similarly for T^. The entries of these matrices can be expressed using 

E[y t yJ +j ] = CA^o) SA tivo)T C T + 5 . qR = CA^^ SA Uv0)T C T + 8 j0 R, 

where 8jo is Kronecker's delta and 

S = E[{xf - 7r)(xf - vr) T ] = diag(vr) - irir T , 

R = E[r) t r)t] = diag(Cvr) - Cdiag(7r)C T . 

The equalities above follow from (3.1) and (3.2); alternatively, the second 
equality can be derived by observing (e.g., using induction) that A 7 = A + 
7rl^ for any j > 0, and that 1^5 = (Sl n ) T = 0. 
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4. Projecting future observations on past ones. Equations (3.5) and 
(3.6) yield 

m—l 

(4.1) y t+m = CT l ~ X x t + ^ CA , ~ 1 Ke t+m -j + e t+m . 

3=1 

Defining O = (C T ~jC C T (A 2 ) T C T • • • ) T , we can write this in matrix 
form as yf = Oxt + Msf for some matrix A4 . Now, write for the Hilbert 
space spanned by the elements of y~[ and "P-- for the projection operator 
onto this space. Since Xt G by definition and ef is orthogonal to y^ , we 

obtain V--y^~ = Oxt- Furthermore, it follows from [3] [see (3.5) and (3.6)] 
that 

oo 

(4.2) x t = J2(A - KCyKy t _j = Ky>, 

j=0 

where K = (K, (A - KC)K, {A - KC) 2 K, ...). Hence, V--yf = OKy^ = 

f3y^ , with, obviously, (3 = OK. Thus, (3 is the matrix that describes the 
optimal linear prediction of yf from y~[ . We note that (3 = TiT^ , where su- 
perscript f denotes the Moore-Penrose pseudo-inverse; that is well defined 
is proved in Section 6. 

Now, assume that we want to predict in a finite horizon context, that 
is, we want to predict linearly yf{k) from y~j~ (k) for some k. The optimal 
weight matrix (3k is then (3^ = 7ik^ k - It is indeed necessary to use this gen- 
eralized inverse since and other matrices are singular as a result of the 
nonminimality of the system, to be discussed in detail in Section 6. Note 
that (3k is not an approximation of (3, as the latter matrix is infinite whereas 
Pk is kt x k£, but we will show below that the predictions obtained using 
these two matrices become identical as k — > oo. 

The following assumption ensures that the matrix (3 has maximal rank, 
apart from a rank loss of one due to the nonminimality of the system. 

Condition B. The matrix O has rank n, and the matrix /C has rank 
n — 1. 

This condition thus states that O is full rank, and that the relation l^K = 
is the only rank deficiency of JC. It ensures that (3 = OK, has rank n — 1. In 
terms of linear systems theory, the first part of the assumption is equivalent 
to saying that the observability matrix formed by the pair (A, C), that is, the 
n£xn matrix formed by the first n block rows CA l ~ l , has rank n - 1 (cf . [29] , 
Theorem 9.11); this is a straightforward application of the Cay ley-Hamilton 
theorem. The second part amounts to saying that the controllability matrix 



SUBSPACE ESTIMATION METHODS FOR HMMS 



7 



formed by the pair (A — KC,K), that is, the n x ni matrix formed by the 
first n block columns (A — KC) l ~ 1 K of /C, has rank n — 1 (cf. [29], Theorem 
9.5). This is equivalent to assuming that the controllability matrix formed 
by the pair (A,K) has rank n—\ (see [29], Theorem 18.16). 

5. The algorithm. Assume that we have observed {yt)J=\ and define, for 
some k > 1, the data Hankel matrices 



(yi yi ■ ■ ■ vt \ / yo yi ■■■ yr-i \ 



2/2 J/3 ••• yr+i 



Y~ = 

\yk yk+i ■■■ yr+k) \y-k+i y-k+2 ■■■ yr-kJ 



y-i yo ■■■ yr-2 



where entries y s with s < or s > T are set to di for arbitrary i. Empirically 
centred variables are defined by letting my+ and my- be the row-wise means 
of Y + and Y ~ , respectively, and putting Y + = Y + — m Y + and Y = 

Y~ — my- 1 J- Then 7ik = T~ 1 Y + Y andrfc = T _1 y Y are estimates 
of Tfc and Hk, respectively, and the standard least-squares estimate of (3k is 

p k = (F + F" T )(F~F~ T )t = H k r{. 

Before proceeding, we need to reduce the rank of (3k, which is k{£ — 1), 
to that of the true (3, which is n — 1. This is typically done by means of a 
singular value decomposition (SVD). Let 

(3 k = UAV T = (U U U 2 ) ( X n £ ) (Vi, V 2 ) T = Ui^uV? + U 2 l 22 V 2 T , 

where An and K 22 are (n — 1) X (n — 1) and (ki — n + 1) x (kl — n + 1) 
matrices, respectively be the SVD of We define estimates Ok and of 
size ki x [n — 1) and (n — 1) x ki, respectively to satisfy O^Kk = UiAuV^ , 
and our particular factorization of this equality will be 

6 k = u 1 , £ fc = A n i> T . 

Several other factorizations are more common in the literature, but this one 
is relatively simple to analyze as will be illustrated below. Our framework 
is however applicable to the analysis of other methods as well, such as for 
instance the canonical variate analysis (CVA) method (see [10]). Note that 
(3k is an estimate before model order reduction and that OkKk is an estimate 
after reduction. Thus, while (3 = OIC, (3k and OkKk are not equal. 

Another aspect of this factorization is that while the product OkKk ap- 
proximates a finite block of the infinite matrix O/C, the factors Ok and K-k 
do not approximate O and /C, respectively in the same sense. As a result, 
the estimated states Tt below differ from the xt by a change of basis, and 
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the estimated system matrices A, etc., will differ from their true values by 
the same change of basis. Thus, the matrix A will in general not have values 
in [0,1], or column sums being unity; to find such a representation of the 
system requires computing a corresponding similarity transformation. An 
alternative way to view this problem is by noting that in (3.3) and (3.4), we 
can multiply the states xt and innovations £t by any nonsingular matrix S, 
without changing the system if the matrices A, C and K are transformed 
accordingly. Predictive distributions computed using the estimated system 
matrices are consistent in the usual sense, however, without a change of 
basis. _ 

Returning to the algorithm, an estimate X = (xq, . . . ,xt-i) of the unob- 
served centred (predicted) state sequence (xo, X2, ■ ■ ■ ,%t~i) is constructed 
as X = )Ck(Y~ — (lfc <8> my)lj) [cf. (4.2)] where <8> denotes Kronecker prod- 
uct and my = T _1 Yli Vt- Subtracting a common mean (of all observed yt) 
makes the estimator slightly easier to analyze. Approximating x~t by xt has 
two sides: first K, is replaced by a matrix /C& comprising the first k blocks of 
/C, and then is replaced by an estimate. The estimate x is expressed in 
a different basis than is x; note, however, that x and x lie in linear spaces 
R™" 1 and S„, respectively, of equal dimensionality. An estimate of Xt is 

obtained by adjoining the affine component 1, that is, we set x% = \x t ,1] T 
and define X as the corresponding n x T matrix. Again, xt differs from xt 
by a change of basis. 

An estimate X\ of (x\, X2, • • • , #t) is obtained by dropping the first column 
of X and then replicating its last column once. Our approximation of the 
system (3.3)-(3.4), up to a change of basis, is then 

Xi = AX + KS, Y + = CX + £. 

Estimating A, C and K from these equations using linear regression gives 

A = X l X T {XX T )\ C = Y + X T {XX T )\ K = X 1 £ T {££ T )\ 

where £ is the residual vector £ = Y + — CX\ note that the regressions to 
find A and K, respectively can be done separately since £ is, by its definition 
as residuals, uncorrelated with X (cf. [10], page 1867). 

We remark that as the nth component of xt equals 1, it is clear that the 
estimated regression coefficients for this component, that is, the nth row of 
A, will be [(£_!, 1]. Thus, [0j_ 1; 1] is a left eigenvector of A with eigenvalue 
1, so that A, like A, has one eigenvalue that is 1. 

6. Structural properties. This section explores some structural proper- 
ties of the state-space form of the HMM. The crucial difference between a 
"standard" state-space model and the model (3.1)-(3.2) is that the latter 
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is not minimal. As noted above, x^ 1 lies in an (n — 1) -dimensional affine 
subspace of R n and, likewise, yt lies in an (£ — l)-dimensional affine sub- 
space. In addition, A has a nonstable eigenvalue 1. Recall that K satisfies 
l^if = 0; also, A — KC has a single eigenvalue 1 on the unit circle and its 
other eigenvalues inside the unit circle [3]. Obviously, X^{A — KC) = 1^, 
so that lj is the left eigenvector corresponding to the eigenvalue 1. Let- 
ting 7 be the corresponding right eigenvector, normalized so that 1J7 = 1, 
that is, (A — KC)j = 7, we can define the matrix J = (A — KC) — 7IJ. 
This matrix plays a role similar to that of A. In particular, J is stable and 
(A - KC) j = J j + 7IX for any j > 0. Thus, the blocks (A - KC) j K that 
build K. equal J J K, and hence decay geometrically fast as j — > 00. 

Now define an n x (n — 1) matrix U n = [u±, 112, ■ ■ ■ , u n -i] with orthonormal 
columns that span S^. The vector n~ 1 / 2 l n has unit length and so Ilg^ = 
n _1 l n l^ is the matrix projecting a vector in R" on S n and II S ± = I n — 
n _1 l n lj is the matrix projecting on S^;. Alternatively, Il s x = U n llJ . The 
subspace is isomorphic to R n_1 , and this isomorphism is represented by 
the mappings U n : R n_1 -» and Uj:S^-> R n_1 . 

If x G S„ , then l^Ax = l^x = 0, that is, Ax G S^. Thus, A is a linear 
map on Likewise, C is a linear map from S^" to Sj~ and the covariance 
matrices S and R are linear operators on and Sj~, respectively, since 
lJS = and ljR = 0. The matrices A = AU n = U^AU n , S = U^SU n , 
R = Uj RUi and C = Uj CU n are the corresponding linear mappings oper- 
ating between the isomorphic spaces R n_1 and R^ _1 . In particular, A is an 
(n — 1) x (n — 1) stable matrix. Define for any k > 1 the hi x k(£ — 1) block 
diagonal matrix 



Oexe-i Ue ... Oexe-i 



\0exe-i Oexe-i ••• Ue J 



with k diagonal blocks, and = uf^ T T k U^\ Then is the matrix of 
covariances generated by a system like (3.1) and (3.2), but with system 
matrices (A,C) and covariance matrices (S,R). Indeed, using that is 
invariant under both S and A and that 5TI S ± = S, one finds that for j > i 

n 

say, 

CA ( j- i ^°S(A T f-^ vo C T + SijR = UjFijUe, 
where Tij is block (i, j) of T. Likewise, one can show that T k = uf ] T k uf ]T . 



Condition C. The eigenvalue of the covariance matrix R is simple. 
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This condition implies that T k is nonsingular for all k. It also implies that 
the spectral density matrix f(u>) of the system (A,C;S,R), given by 

J{u) = M f; e- i ^ t CA^ t ^S(A T )^C T + <5 0t i?|, 

\t=— oo ) 

has^eigenvalues bounded away from zero. Since A is stable the eigenvalues 
of f(uj) are bounded away from infinity as well. These observations allow 
us to make the stronger conclusion that the eigenvalues of T k are bounded 
away from zero and infinity uniformly in k (cf. [15], pages 265-266). In fact, 
the same holds for the infinite-dimensional matrix V, which is thus also 
invertible. Now, put rj. = Y^lff . Using UjUi = one readily 

verifies that (i) r fc rj.r fc = T k , (ii) rj,r fc rj. = rj., (iii) r fc rj. is symmetric and 

(iv) rjXfc is symmetric (the latter two matrices are both n[?]_). This implies, 

t 

as the notation suggests, that Tl is the Moore-Penrose inverse of T k [25], 

pages 167-168. Similarly, r+ = ul° o] f- 1 u\ oo]T . 

A slightly different way to view the same properties is to note that for any 
vector «6Sj, v T T k = 0. Thus, zero is an eigenvalue of F k with multiplicity 
A;. The remaining eigenvalues agree with those of T k , and the assumption 
above guarantees that none of them is zero. Since ljyt = 0, we also find that 
v T Y = 0, and hence v T T k = for any dgS*. Thus, zero is an eigenvalue 
with multiplicity k of T k as well, and its remaining eigenvalues are those 
of T k = uf^T^U^. These properties imply ||r^|| = 1 1 T^. 1 1 , 1 1 T^. 1 1 = HT^H and 

||r& — = \\Y k — Y k \\. Moreover, r|, = U^T k uf^ T if Y k is nonsingular. 

7. Consistency. In this section, we give a consistency result for the pa- 
rameter estimates of Section 5. The outline of the proof to a large extent 
follows that of [10]. First, in Section 7.1, we establish the rate of convergence 
of the sample covariances of the centred process. Section 7.2 contains results 
on consistency of the linear prediction matrix estimates, which are needed to 
carry out the proof in Section 7.3 of consistency of the subspace algorithm. 
Throughout this section, we assume that Conditions A-C hold. 

7.1. Convergence rates of sample covariances. The following result es- 
tablishes convergence rates of sample covariances, uniformly over some range 
of lags. We will write Zt = o(gt), if zt/gt —> a.s. as t — > oo. 

Theorem 1. For any 5 > and k T < T a , with a = r/(2(r - 2)) and 
r>4, 

T 

0(Qt) as T — > oo, 



(7.1) max 



■^J2ytyt+s-HvoyJ'. 
t=i 
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With Q T = T^ 1 /2( fcTlogT )2A(l gl g T )2(l+5)/r_ 

Proof. Put xf 1 = x t — ir. Subtracting ir and Ctt from (3.1) and (3.2) 
yields 

xf +1 = Axf + 6+i = Mf + 6+i , Vt+i = Cxf + r} t+1 . 

Using this representation, the proof proceeds essentially as that of Theo- 
rem 1 in [17]. The main difference is that the process S T (j,k,t) appearing 
therein is not always a martingale, which in turn invalidates the application 
of Doob's and Burkholder's inequalities. Instead, one must rely on corre- 
sponding inequalities for strongly mixing processes (Theorem 2 of [11], page 
26). A complete proof can be found in a preprint version of the present paper 
[2]. □ 

7.2. Convergence of moment matrix estimates. Theorem 1 has immedi- 
ate applications to the convergence of the moment matrix estimates. 

Proposition 1. Under the assumptions of Theorem 1, and provided 
k = kj- is such that A;0(Qt) — ► 0, it holds that \\(3k — Pk\\ = ^0{Qt)- 

Proof. We shall first prove that \\7ik — TCk\\ and \\Tk — Tk\\ are both 
kO(Q T ). Block (i,j) of H k - H k is 

1 T 

7p ^{Vi+u-1 - m Y +.i){y- j+ u ~ m Y --j) T - E[y yl( i+i) ] 
«=i 

I T 

= ^{Vi+u-lV-j+u ~ E boy-(i+j)]} + (C^ - m Y + ;i )(Cir - m Y - ;j ) T , 

u=l 

where m Y +. i is the ith block of m Y +, etc. The sum on the right-hand side 
agrees with that of (7.1), with s = — (i + j) + 1, except for 2(i — 1) terms with 
the lowest and highest indices. This remainder term is thus 0(k/T), which 
in turn is 0{Qt)- The whole sum is therefore 0{Qt)- We note in passing 
that the maximal (in absolute value) s considered this way is — 2k + 1, while 
\s\ < k is required in Theorem 1; however, modifying the range of s by a 
constant multiple (here 2) does not affect the validity of the theorem. 

Regarding the second term on the right-hand side, each factor is, apart 
from a remainder term of order O(kfT) as above, ©((loglogT/T) 1 / 2 ) by a 
law of iterated logarithm for strongly mixing processes [23], Theorem 5. The 
whole second term is therefore 0(loglogT/T) + 0(k 2 /T 2 ), which again is 
readily checked to be 0{Qt)- Thus, each block of Hk — TCk is 0{Qt) and we 
obtain, for example, by bounding \\Hk — ~Hk\\ by its Frobenius norm, that 
\\T~Ck ~ 7~(-k\\ = kO{Qx)- In a similar fashion, we find \\T k — T^W = kO(QT)- 
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Corollary 8.1.6 of [13], page 396, shows that ||Aj(Tfc) — Ai(Tfc)|| < \\Tk — 
Tk\\ , where Aj(-) is the ith largest singular value of a matrix. We note that 
and Tk are positive semi-definite, and hence their respective singular values 
and eigenvalues coincide. As noted in the previous section, both and 
have the eigenvalue zero with multiplicity (at least) k. We also saw that the 
multiplicity is exactly k for and that its smallest nonzero eigenvalue is 
bounded away from zero in k. If kO{Qx) — ► as T — > oo, we can thus draw 
the same conclusions about from the above inequality. 

For T large enough that the eigenvalue zero of has multiplicity k, write 

llfl -rjll = ||ftr fe rt - ftf fc rl|| < ||ft||||f fc - r fc ||||r|||, 

where the first equality is checked as in the previous section. Here, ||rt|| 
equals the reciprocal of the smallest nonzero singular value of I" 1 *., whence 
\\^\.\\ and \\^\.\\ are both bounded as T — > oo, and the middle factor is 
kO(Qr)- We thus obtain 

Wfa -M = - n h Y\\\ < \\n k - n k \\\\rl\\ + ||w fc ||||ft - r{ \\ 

= kO{Q T )0{l) + 0{l)kO{Q T ) = kO{Q T )- 

\\Hk\\ is bounded in k since the norm of its block ^[yt+iVt-j+i] t en ds 
to zero geometrically fast as i + j — > oo. The proof is complete. □ 

The next result tells us how well (3k approximates Ofe/Cfc. 

Proposition 2. There is a p£ (0, 1) such that \\/3k - Cfc^fcH = 0(p k ). 

Proof. Put K k = (K, {A - KC)K, (A - KC) 2 K, ...,(A- KCf^K), 
put Gk = OkKk an d G = OfC; notice that G = (3. Throughout the proof, we 
will use matrix subindices in parentheses to denote a "block index," where 
the block size is I: 1^) is the identity matrix with k blocks (that is, of size 
kl x k£), T 

(i:fc,fc+i:oo) is the submatrix of T formed by block rows of indices 
1 through k and block columns of indices k + 1 and upwards, etc. 
Notice that, since 

[flk — Gk,0(kxoo)] - [0(fc X fc)i G'( 1:fe]fc+1:00 )] 

= [Pk — Gk,—G(l:k,k+l:oD)] = [Afc) O(jfcXoo)] — C(l:fc,l:oo) 

= [/3fc,0( fcxoo) ] - [I( k ),0(kxoo)]G = [/3fc,0( fcxoo )] - [/(fc),0( fcxoo )]W i ' 

= [Wfcrl,0( fexoo )] - W( 1:fcjl . 00 )r' i '. 

Post-multiplying the right-hand side by r (1:00)1 . fe ) = T[I {k) , {fcxoo) ] T yields 

HkT\v k - W(i : fe i i : oo)r t r[/( fc ),0( A . XOO )] T = Hk- W(i :fcj i :OO )[/( fc ),0( A . XOO )] T 

= Tik - H-k = 0(fc X fc); 
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these equalities are true since, as follows from Section 6, T k for instance 
works as a proper inverse on the space (Sj~) k on which V and rt operate. 
We thus find the following sequence of equalities: 

[Pk — Gk,0(kxoo)]F(l:oo,l:k) = [Q(kxk)>G(l:k,k+l:<x)W(l:oo,l:k)> 
(Pk — G k )T k = G(l:k,k+l:oo)F(k+l:oo,V.k), 

Pk — G k = 0fc/C(j fc+1:oo )r( fc+1 . OO)1:fc )r jfc . 

The squared norm of O k equals the largest eigenvalue of the nX n matrix 
OjO k = Y,']Zo(A t ) T C T CA\ Because A is stable this sum is convergent, and 
we conclude that ||0fc|| is bounded in k. Moreover, \\T k \\ is bounded in k, 

and ||r( fc+1:00jl:fc ) || is bounded in k since A^SA? tends to zero geometrically 
fast as i A j — > oo. Recalling (Section 6) that the blocks of fC tend to zero 
geometrically fast, we conclude that so does ||£(fc+i :0 o)ll- Hence, \\Pk — Gk\\ = 
0(p k ) for some p G (0, 1), and the proof is complete. □ 

7.3. Consistency of parameter estimates. 

Theorem 2. For the algorithm of Section 5 with truncation index k = 
kx cls in Theorem 1 and satisfying k-x — > oo and k^Qr — > as T — > oo, there 
are nonsingular random matrices {St}, with \\St\\ and |[ jS^ 1 1| bounded a.s., 
such that \\A- S T ASt 1 \\=o{1), \\C-CS^ l \\ =o(l) and \\K- S T K\\ = o(l). 

Proof. The proof consists of several parts. First, we show that Ofc/Cfc 
is close to Ofc/Cfc. Then we show that the estimated centred state sequence 
{xt} is, up to a change of basis, close to the true centered (predicted) state 
sequence {xt}. Then we show a similar statement about the noncentered 
estimated and true state sequences, and finally, we show that the estimates 
A, G and K are consistent in the sense stated above. 

Claim 1. \\6 k )C k - O k KL k \\ = kO(Q T ) + 0{p~ k ) for some p G (0, 1). 
To prove this claim, note that O k fC k = P k — 112^-22^2^ an d thus 

\\6 k K k - o k K k \\ < \\p k - eyCfcli + ||t/ 2 A 22 y 2 T ||. 

The norm of C/ 2 A 22 V 2 T equals its largest singular value, which is the nth 
singular value of UAV T = p k . Since O k K. k has rank at most n — 1, its nth 
singular value is zero, whence, again using Corollary 8.1.6 of [13], page 396, 

||C/ 2 A 22 F 2 T || = a n (p k ) = \a n (p k ) - a n (O k JC k )\ < \0 k - O k )C k \\. 
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We conclude that 

\\6 k K k - O k K k \\ < 2\\p k - O k K k \\ < 2\\p k - k \\ + 2\\O k K k - p k \\, 
and Claim 1 follows by Propositions 1 and 2. 

Claim 2. With S T = dlO k U s ±, we have HT" 1 Y,I=i(STX t -%)(S T x t - 
^) T H =o(l). 

First recall that Il s ± is the projection matrix onto S^. Since xt G we 
thus can and will, within the proof of this claim, take St = O k O k . 

In order to prove the claim, notice that since O k has orthonormal columns, 
d\ = 6J, 6\6 k = / n _i, and hence ~S T x t -% = 6\{O k x t - O k %). Moreover, 

O k x t - 6 k % = O k [K k , (A - KC) k K}% - 6 k Z k (yt(k) - l k ® m Y ) 

= {O k K k - O k IC k )y;(k) + O k (A - KC) k lCy^_ k 

+ O k iC k (l k <g> my - l fc ® (Ctt)) = z\ + z 2 t + z\ 

say. Using the orthonormality of O k , the norm of interest is bounded by 



3 


T-l 




T-l 


1/2 


T-l 


E 


E ^ 


+ 2E 


E ^ 




E z i z t T 




t=0 




t=0 




t=0 



where we used the Cauchy-Schwarz inequality for the cross products. First, 
\\T _1 J2t=o z t z t T \\ i s bounded by 

\\O k K k - 6 k K, k \\ 2 \\t fc || = [{kO(Q T )f + 0{p~ 2k )]0{l) = o{l) 

by Claim 1 and the assumptions of the theorem. Here, T k is a sample co- 
variance matrix similar to T k but with observations centred using the true 
(unknown) mean; HT^H is bounded in k because so is ||rfc|| (see Section 6) 
and, similar to the proof of Proposition 1, \\T k — T k \\ = o(l). 

Next, we recall that (A — KC) 3 K = J^K where J is stable. Thus, z 2 = 
O k J k [K,JK,J 2 K,...]y~_ k , and hence both ||z t 2 || and HT" 1 J2l z 2 z 2T \\ tend 
to zero geometrically fast in k. 

Finally, zf does not depend on t, whence \\T~ l J2t=o z t z f T \\ * s bounded 

by 

\\d k iC k \\ 2 \\l k ® m Y - lfc ® (Cvr)H 2 = \\d k jC k fk\\m Y - Ctt\\ 2 . 

Here, Claim 1 shows that the first factor is bounded and, as in the proof 
of Proposition 1, a law of iterated logarithm ensures that the last factor is 
O (log log T/T). It can be checked that fcO(loglogT/T) =o(l), so it follows 
that the left-hand side tends to zero. 
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The proof of Claim 2 is complete. In fact, the argument contains a small 
error, in that xt does not equal fCk{y^ (t) — lk ® my) for t < k, as y s for s < 
are not observed. Adjusting for this error gives extra sums with k terms, and 
thus remainder terms of order 0(k/T). These are however of smaller-order 
than the main terms above. 

Claim 3. With 



St = 



St 0„_i 






ol i - 




L -"-n J 



it holds that \\T 1 YltLi(STXt — %t){ST%t — %) T \\ = 

This claim follows easily upon observing that St is a composition of two 
mappings acting as follows. Since l^Xt = 1, the first matrix maps x t into 
a vector in R™ +1 whose first n components are xt = xt — vr, and whose 
(n + l)st component is 1. The second mapping maps the first n components 
xt into STXt € R n ~ 1 and adjoins the last component 1. The output of the 
concatenated mappings is thus [(5t^) t ,1] t - Since the last component of 
xt is 1 by construction, the difference Sr^i — %t equals SxXt — xt, with a 
final component adjoined. The claim now follows from Claim 2. 

Claim 4. The (n— 1) xn matrices St are such that their singular values 
are bounded away from infinity and zero as T — > oo . 

First, we note that, since Ok has orthonormal columns, = 1. Like- 

wise, ||n s _i_|| = 1, so that IIStII < ll^fcll- Here, is bounded in k (see the 

proof of Proposition 2), whence ||<St|| is bounded from above. 

We must now show that the (n — l)st singular value of St is bounded 
away from zero as T — > oo. Following [10], page 1870, we first note that 
since the dimension of OkKk grows with k, it holds that A n _i(Ofc + i/Cfc + i) > 
A n _i(Ofc/Cfc) for all k. Thus, because OK, has rank n — 1, A n _i(C/ c /C/ c ) is 
bounded away from zero for k sufficiently large. Using that \ n -i{f3k) > 
Xn^iOklCk) - \\Pk ~ O k K k \\ (Corollary 8.1.6 of [13], page 396), we find by 
invoking Propositions 1 and 2 that A n _i(/3fe) is bounded away from zero as 
well. Now A n _i(/3fc) = A n _i(An), and since V\ has orthonormal columns, the 
singular values of K-k = AnV^ and An agree. Thus, A n _i(A^fc) is bounded 
away from zero. 

To complete the proof of Claim 4, let u T be the (n — l)st left singular 
vector of St = 0\Ok with unit length, that is, u T St = A n _i(5r)u T - Then 

\n-i(ST)\\KkK,\\\ > \udlOkU si iCkicl\ = \ u + udlEjcl\ 
>\ u \-\udlEjcl\>i-\\ol\\\\E\\\\icl\\, 
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where E = Otllci/Ci, — ObfCh- In addition Ila±/Cfc = fCu because I n = IIa_L + 

n n n 

Ils n and 1^ is a left eigenvector of /C^ with eigenvalue (cf. Section 6), so 
that E = Ofe/Cfe — Ofc/Cfc. Since \\0' k \\ = 1, \\JC' k \\ is bounded and \\E\\ — > 0, we 
find that A n _i(5'r) is bounded away from zero and Claim 4 is proved. 

Claim 5. The nx n matrices St are such that their singular values are 
bounded away from infinity and zero as T — > oo . 

The squared singular values of St equal the eigenvalues of the n x n 
matrix 

StSt = (In - l n K T )S] : ST{In ~ 7rlJ) + ln^n • 

The smallest eigenvalue of this matrix is the minimum of x T (Sj> St)x over 
x G R n such that \x\ = 1. Pick x G R n with \x\ = 1, and represent this vector 
(uniquely) as x = x\ + x 2 , where x\ G S„ and X2 G S^; also, |xi| 2 + | 1 2 = 1- 
Furthermore, write x\ = as where s = \ n /y/n, so that |xi| = a. We find that 
(I n — 7rlJ)xi = aw, where w = (I n — vrlj)s G S^. Moreover, (/„ — 7rl^)x 2 = 
X2- Thus, 

(7.2) x T (S^S T )x = a 2 s T l n l n ~s + {aw + x 2 ) T (S^ S T )(aw + x 2 ). 

Consider St as a linear mapping from S^" to R n_1 . Then there is a linear 
mapping St from R n_1 to S^;, called the adjoint of St, which satisfies and 
is defined by the relation (Stu,v) = (u,S T v) for all m£S^ and v G R n_1 
(e.g., [14], page 216); here (•,•) is the inner product of a linear space. It is 

straightforward to check that S T is represented by the matrix S J , and thus 

St St represents the quadratic form S t St acting on S^;. This quadratic 
form has n — 1 eigenvalues given by the squared n — 1 singular values of the 
matrix St mapping S^; to R n_1 . By Claim 4, we know that these singular 
values are uniformly (over T) bounded from below by some A > 0, so that 
the eigenvalues of S t St are bounded from below by A 2 . We conclude that 

(7.2) is bounded from below by 

(7.3) no 2 + X 2 \aw + x 2 \ 2 > (n A A 2 )(a 2 + \aw + x 2 \ 2 ), 

where n A A 2 > 0. Here, the factor a 2 + \aw + x 2 \ 2 is bounded away from zero 
because either a = \x±\ is not close to zero, or otherwise x 2 £ S^, of squared 
length \x 2 \ = 1 — a 2 , cannot closely approximate a small vector aw G S^. 
We now make this argument precise. If w = 0, the right-hand side of (7.3), 
apart from the factor n A A 2 , equals |xi| 2 + \x 2 \ 2 = \x\ 2 = 1 > 0. Now, assume 
w ^ 0. Then for small a = \x\\, say small enough that o 2 |tu| 2 < (1 — a 2 )/16 = 
|x2| 2 /16, the right-hand side of (7.3), again apart from the first factor, is 
bounded from below by 

a 2 + a 2 \w\ 2 + \x 2 \ 2 - 2a\w\\x 2 \ > a 2 + 1 - a 2 - 2 1 " > ^ > 0. 
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If a is larger and hence satisfies the reverse inequality, i.e., a 2 > l/(16|u>| 2 + 
1) > 0, the right-hand side of (7.3) is, again apart from the first factor, 
bounded from below by the same number. Thus, we have shown that (7.2) 
is uniformly bounded from below over \x\ = 1, and Claim 5 follows. 

We finally turn to proving consistency of the estimated system matrices, 
up to a similarity transformation. The estimates are 

A 

C 



K 

with e m = y t+ i - Cx t . 
Consider 

1 

if Y Vt+\*t ~ ®[yt+ixJ]Sj 

1 T-l / 1 T-l \ 

= — Y yt+^t - s Tx t ) T + \7pY^ yt+i x t - Hvt+ixJ] sj. 

t=0 \ t=Q / 

By Claims 3 and 5 and the ergodic theorem, the norm of the right-hand 
side of this display is o(l). Similarly, one can show that T~ l Y^t=\XtxJ — 
S T E[x t xJ]S^ = o(l) and T" 1 J^=i x t +ixj - S T E[x t+ ixJ]S^ = o(l). Thus 

C = (E[y t+1 xJ]S^ + o(l)){S T E[x t xJ]S^ + o(l)) f 
= Eiyt+^JjEixtxJ}-^^ 1 + o(l) = CS? 1 + o(l), 

where the last equality follows from (3.3) and (3.4). With completely corre- 
sponding derivations, we can also prove the consistency of A. 

For analyzing K, we first need the result HT" 1 J2t=i( £ t+i — £i+i)( £ t+i ~~ 
£j + i) T || = o(l), which can be proved using the decomposition et+i — Et+i = 
{Cxt — CSxxt) — (Cxt — CSxXt) and then proceeding as in Claim 2, using this 
claim as well as the consistency of C. One can then establish the consistency 
of K similarly as for C and A, given the following observations. First, writing 
xt = Xi — xt for the linear prediction error of , which is uncorrelated 
with xt, this together with (3.4) yields Efei+ie^] = R + CVC T , where 
V is the covariance matrix of xt (cf. [3]). The matrix V has, just as R, an 
eigenvalue with eigenvector 1^. Thus, so has E[et+i£'^ 1 ] as well, and under 
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Condition C this eigenvalue is simple. Second, because K = AVC (R + 
CyC T ) t ([3], equation (13)), the rows of K are in Sf . Combining these two 
facts, we find that 

E[x t+1 eJ +1 }(E[e t+1 eJ +1 ])^ = KE[s t+1 eJ +1 ](E[e t+1 eJ +1 ]Y = K, 

because the product E[ei + ie^L 1 ](E[et + ie^ fl ])^ is the identity as a mapping 
on S^" (cf. [3], Lemma 1). Third, the row vector lJC estimates the coeffi- 
cients of the optimal linear prediction of lJ yt+i from Xf. Since \Jyt+i = 1 
and the last coordinate of x t is 1, we have lJC = [Oj„ 1 l]. Thus, ljet+i = 
lj (yt+i — Cx t ) = 0, i.e. £t+i € Sj . Therefore, the convergence 

, T-l , T-l 

t=0 t=0 

implies that the difference of the corresponding pseudo-inverses is o(l), as 
both sums have a simple eigenvalue with the same eigenvector \g. □ 

Our next result shows that linear predictors built from the matrices A, 
C and K, are consistent without any change of basis, in the sense that 
they converge to optimal linear predictors given the true system matrices. 
Consider an observed history y = (yo, y~i,y~2, • • •)• It follows from (4.1) and 
(4.2) that the optimal linear predictor of y m , that is, the function </>Jjf (y) 
minimising E\y m — h(y)\ 2 over all linear functions h of y, is 

oo 

(7.4) ^f(v) = CA™" 1 J> - KCYiy-j - E[y }) + E[y j. 

3=0 

In practice, one of course computes the sum above, which equals xq, recur- 
sively as additional observations become available. Since y m has binary ele- 
ments, <p^f(y) approximates the vector of conditional probabilities F(y m = 
• | y). These conditional probabilities, which govern the optimal predictor 
0771* (y) minimizing E\y m — h(y)\ 2 over all functions h of y, are computed as 
CA m ~ 1 F(xi I = • | y) where the conditional probabilities P(xj = • | y) are 
essentially given by the filter (the forward pass of the forward-backward 
algorithm for HMMs); see, for example, [8], page 56. 

Theorem 3. Let z = (zq,z-i,z-2,.-.) be a sequence in {di,d,2, ... ,de}, 
and consider the mapping 

oo 

(7.5) fez) = OA™" 1 £(I - Kdy{z-j - E[y ]) + E[y j. 

j=0 

Then, given the assumptions of Theorem 2, this mapping converges a.s. to 
the optimal linear m-step predictor for the system (A,C), in the sense that 
(7.5) converges a.s. to (7.4), with Zj instead of yj, as T — > oo, uniformly 
over all sequences z. 
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Before giving the proof, we remark that we can replace the mean E[yo] 
by an estimate such as the sample mean T~ 1 J2o~ 1 yti which converges a.s. 
to E[yo]- We also remark that since lJC = [0,J_ 1 1] (see the final part of the 
proof of Theorem 2) and the last row of A (and hence of all A 771-1 ) equals 
[0^_]1] (see Section 5), we have lJCA m ~ 1 = [0„_]1]. Moreover, from the 
proof below it follows that lJCA m ~ 1 (A - KC) 3 K = 0. We thus conclude 
that the prediction in (7.5) is a vector with elements summing to unity, 
as they should. We conjecture, but have not been able to prove, that the 
elements of this vector are in fact in [0, 1] . Indeed, in the simulations reported 
in Section 8, not a single computed (jr™ (z) contained elements outside this 
range. 

Proof of Theorem 3. Since Halloo = 1 and (A - KC) j K = J 3 K 
where J is stable (cf. the proof of Theorem 2), we find that for any 5 > 
there is an M > such that truncating the sum in (7.4) after M terms yields 
an error less than 6. Now, using Theorem 2, one finds that the sum over the 
first M terms in (7.5) converges to the corresponding finite sum of terms in 
(7.4) (with Zj), uniformly because the Zj are uniformly bounded. 

What now remains to prove is that by choosing M large enough, the tail 
sum over terms j = M,M + 1, ... in (7.5) is at most 5, uniformly over z. 
This follows as the product (A — KC) 3 K has a structure similar to that 
of (A — KC) 3 K. First, notice that since the last coordinate of $t+i is 1, 
the last row of the sum T -1 J2o~ l ^t+i^t+i i n the expression for K will be 
the sum of the sj + i- This sum is zero however, because of the definition of 
these variables as residuals in a linear regression. Thus, the last row of K is 
zero, and so is the last row of KC. Moreover, the last row of A is [0j_ 1; 1] 
(see the end of Section 5), a vector that we denote by 1^. Therefore, the 
last row of A — KC is 1^. This implies that A — KC has an eigenvalue 
1 with left eigenvector 1„. Denote the corresponding right eigenvector by 
7, normalized such that 1J7 = 1. Then (A — KC) 3 = J 3 + 7IJ say. Since 
A — KC converges to A — KC up to a similarity transformation, the matrix 
J will be stable for large T, with eigenvalues converging to those of J. Thus, 
the size of the tail sum in (7.5) is of the order Sp(J) , where Sp is the 
spectral radius, and we conclude that this tail sum will be less than 5 for 
large enough T. As 5 > was arbitrary, the proof is complete. □ 

8. Examples. We consider three examples, defined by the matrices 
Mo.l O.9J' Mo.l 0.9 j' ^0.05 0.6 0.05 j 
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and 

«-(*;£)■ ess ss). 

respectively. For (Ai,Ci), the Markov chain is rather inert, and the output 
symbol is informative about the corresponding Markov state. For (A2,Cz), 
the dynamics of the hidden chain is as before, but the output symbols are 
much less informative about the current state. For (^3,63), states e\ and 
es are quite inert, while visits to state e2 are short. The output is quite 
informative about the current state, although not as informative as for C\. 

We simulated data according to these systems, and the subspace estima- 
tion algorithm was run for increasing values of T and k, respectively. For 
each system, we computed estimates A, C and K as in (5.1) for 250 repli- 
cations of data. Then we computed the 1-step predictive distribution <p{ n 
for each time-point in an independent series of 5000 observations simulated 
from the same system. For each time-point t, we computed the ^i-norm 
|^\ in — </>i in |i comparing the estimated optimal linear predictive distribution 
to the true ditto, and also the £i-norm |^\ in — </>° pt |i comparing the same es- 
timate to the optimal predictive distribution. These ^i-norms are also total 
variation distances between the respective distributions. 

Table 1 reports averages of these £i-norms over data replications and 
time-points for predictions. We see that the differences between ^\ in and 0\ in 
tend to zero as T (and k) increases, which is in accordance with Theorem 
3. The differences to (^>° p do not vanish in the limit however, reflecting that 
for each system, <p\ n and (/>° pt are not equal. The limits of these differences 
are hence dictated by the systems (A, C) themselves. 

Table 2 examines the effect of the truncation index k, for data of length 
T = 20,000 from the system (A\,Ci). The subspace algorithm is reassuringly 
robust, in that k has a negligible impact on its prediction performance, 
provided k > 5 say. 

9. Discussion. The present paper is a first step towards a rigorous un- 
derstanding of the use of subspace methods for estimation of and prediction 
with HMMs. However, many important problems remain, the most obvious 
one being that of obtaining a positive realization of the system matrices, 
that is, finding a similarity transformation that provides estimates of A and 
C with positive entries summing to unity columnwise. 

The algorithm outlined in [20], Section VI. D, Step 2, based on feasible 
directions, finds a positive realization for a problem that is related to ours, 
but with a cost function that is not directly applicable to the present prob- 
lem. Presumably, the algorithm could however be modified to work for the 
present setting. The algorithms in [9, 24] find a nonnegative matrix with a 
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Table 1 

Predictive distribution errors for the 1-step ahead linear predictor (7.5), with E[j/o] 
replaced by sample means, over 250 replications of simulated data from the systems 
(Ai, Ci)-(A 3 , d) and 5000 out-of-sample observations. The figures are empirical 
averages of the li-norms l^ 111 — <^i m |i and (A 1 ] 111 — 0° pt |i over the 250 replications (for 
which the estimates A, etc., vary) and over the 5000 out-of-sample observations (for 
which the observed history varies) 



T 


fe 




Mean predictive distribution 


error 




w.r.t. 


optimal linear pred. 


w.r.t. optimal pred. 


(Ai.Ci) 


(A 2 ,C 2 ) 


(A 3 ,C 3 ) 


(Ai,Ci) 


(Aa,C a ) 


(A 3 ,C 3 ) 


1000 


5 


0.0299 


0.0381 


0.0676 


0.0612 


0.0384 


0.1155 


5000 


8 


0.0138 


0.0173 


0.0301 


0.0564 


0.0177 


0.0992 


10,000 


12 


0.0102 


0.0130 


0.0220 


0.0562 


0.0134 


0.0974 


20,000 


16 


0.0079 


0.0095 


0.0164 


0.0560 


0.0099 


0.0964 


40,000 


20 


0.0060 


0.0070 


0.0117 


0.0559 


0.0075 


0.0958 



given spectrum (set of eigenvalues), with one of them handling constraints 
stating that some matrix elements should be zero [9], page 1030. They do, 
however, not take the matrix C into account, whence also these algorithms 
need adjustments before being suitable for the present problem. It should 
also be stressed that all of the above algorithms are local search algorithms, 
and may hence converge to a point that is not an optimal solution, or is 
not a solution at all. Therefore, they might need to be restarted at different 
initial points. That the algorithms converge only locally is not surprising as 

Table 2 

Results as in Table 1 for the system (A\,Ci) and various k 



Mean predictive distribution error 



T 


k 


w.r.t. optimal linear pred. 


w.r.t. optimal pred. 


20,000 


3 


0.0093 


0.0550 


20,000 


5 


0.0086 


0.0559 


20,000 


8 


0.0087 


0.0559 


20,000 


10 


0.0088 


0.0559 


20,000 


12 


0.0088 


0.0559 


20,000 


15 


0.0088 


0.0559 


20,000 


20 


0.0088 


0.0559 


20,000 


25 


0.0088 


0.0559 


20,000 


30 


0.0089 


0.0559 


20,000 


40 


0.0089 


0.0559 


20,000 


50 


0.0091 


0.0558 
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the underlying problem that they try to solve is nonconvex, but of course it 
removes some of the appeal of the subspace algorithm. 

Other further aspects of the subspace approach are the asymptotic distri- 
bution of the estimators, and how to minimize their asymptotic variance; the 
latter question also requires an understanding of subspace algorithms more 
general than the one presented here in terms of pre and postmultiplication 
of the /3f. by weighting matrices prior to the SVD, and factorization of the 
actual SVD (cf. [10]). 
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